Validity and social dimension of language testing yarmohammadi
' Cronbach' Cronbach; it is only a slight overstatement to call Cronbach the “father” of construct validity. Construct validity is ordinarily studied when the tester has no definite criterion measure of the quality with which he sic is concerned and must use indirect measures. Here the trait or quality underlying the test is of central importance, rather than either the test behavior or the scores on the criteria. Note that the language of “trait” and “underlying quality” frames the target of validation in the language of individuality and cognition. One of the major developments in validation research since Cronbach and Meehl’s article is the increasingly central role taken by construct validity, which has subsumed other types of validity and validation. There is also clear recognition that validity is not a mathematical property like discrimination or reliability, but a matter of judgment. Cronbach emphasized the need for a validity argument, which focuses on collecting evidence for or against a certain interpretation of test scores: In other words, it is the validity of inferences that construct validation work is concerned with, rather than the validity of instruments. In fact, Cronbach argued that there is no such thing as a “valid test,” only more or less defensible interpretations: “One does not validate a test, but only a principle for making inferences” “One validates not a test, but an interpretation of data arising from a specified procedure” Cronbach and Meehl distinguished between a weak and a strong program for construct validation. The weak one is a fairly haphazard collection of any sort of evidence that supports the particular interpretation to be validated. It is in fact a highly unprincipled attempt at verification by any means available. In contrast, the strong program is based on the falsification idea advanced by Popperian philosophy: Rival hypotheses for interpretations are proposed and logically or empirically examined. Cronbach = admitted that the actual approach taken in most validation research is “confirmationist” rather than “falsificationist” and aimed at refuting rival hypotheses as researchers and test designers try to prove the worth of their instruments.1 In his most influential writings on validation within measurement, Cronbach never stressed the importance of the sociopolitical context and its influence on the whole testing enterprise; this is in marked contrast to his work on program evaluation ,in which he strongly emphasized that evaluations are sites of political conflict and clashes of values. However, in his later writings, possibly through his experiences in program evaluation, Cronbach highlighted the role of beliefs and values in validity arguments, which “must link concepts, evidence, social and personal consequences, and values. He acknowledged that all interpretation involves questions of values: A persuasive defense of an interpretation will have to combine evidence, logic, and rhetoric. What is persuasive depends on the beliefs in the community. And he concurred with Messick = that validity work has an obligation to consider test consequences and help to prevent negative ones. Cronbach also recognized that judgments of positive or negative consequences depend on societal views of what is a desirable consequence, but that these views and values change over time . What we have here then is a concern for social consequences as a kind of corrective to an earlier entirely cognitive and individualistic way of thinking about tests. Cronbach’s difficulty in integrating his psychometrically inspired work on construct validity, originating within educational psychology, and his concern for social and political values, originating in his work in the policy-oriented field of program evaluation, has remained characteristic of both the fields of educational measurement and language testing, as we will see. ' Messick' Messick, like Cronbach, saw assessment as a process of reasoning and evidence gathering carried out in order for inferences to be made about individuals and saw the task of establishing the meaningfulness and defensibility of those inferences as being the primary task of assessment development and research. This reflects an individualist, psychological tradition of measurement concerned with fairness. He introduced the social more explicitly into this picture by arguing two things: that our conceptions of what it is that we are measuring and the things we prioritize in measurement, will reflect values, which we can assume will be social and cultural in origin, and that tests have real effects in the educational and social contexts in which they are used and that these need to be matters of concern for those responsible for the test. Messick saw these aspects of validity as holding together within a unified theory of validity. Messick on Construct Validity Those claims then provide the rationale for making decisions about individuals on the basis of test scores. Supporting the adequacy of claims about an individual with reasoning and evidence and demonstrating the relevance of the claims to the decisions we wish to make are fundamental to the process. Because of its centrality to assessment, and its complexity, this aspect of validation continues to attract considerable theoretical and practical attention, to the point where, for some authors, it becomes synonymous with validity, as we have seen. However, you might say, what about situations in which students and health professionals are given temporary admission to the setting in question and are placed under observation? Is not the observation a form of direct inspection? However, even in these cases, we cannot observe the student in all contexts of interest and must sample from them, acknowledging also that if the student knows he or she is being observed, his or her behavior might be modified accordingly. We cannot tell the extent of this modification, as we have no benchmark of unobserved behavior against which we can measure the observed behavior. In principle, then, and in fact, how the person will fare in the target setting cannot be known directly, but must be predicted. Deciding whether the person should be admitted then depends on two prior steps: modeling what you believe the demands of the target setting are likely to be and predicting what the standing of the individual is in relation to this construct. Clearly, then, both the construct communicatively and our view of the individual’s standing are matters of belief and opinion, andeach must be supported with reasoning and evidence before a defensible decision about the individual can be made. The test is a procedure for gathering evidence in support of decisions that need to be made and interpreting that evidence carefully. It involves making some observations and then interpreting them in the light of certain assumptions about the requirements of the target setting and the relationship of the evidence to those assumptions. Tests and assessments thus represent systematic approaches to constraining these inferential processes in the interests of guaranteeing their fairness or validity. Validity therefore implies considerations of social responsibility, both to the candidate and to the receiving institution and those whose quality of health care will be a function in part of the adequacy of the candidate’s communicative skill. Fairness inthis sense can only be achieved through carefully planning the design of the observations of candidate performance and carefully articulating the relationship between the evidence we gain from test performance and the inferences about candidate standing that we wish to make from it. Test validation steers between the Scylla and Charybdis of what Messick called construct underrepresentation, on the one hand, and construct-irrelevant variance, on the other. The former warns of the danger that the assessment requires less of the test taker than is required in reality. We will give examples of this later. The latter warns that differences in scores might not be due only to differences in the ability being measured but that other factors are illegitimately affecting scores. Construct Definition and Validation: Mislevy The work of Mislevy and his colleagues provides analytic clarity to the procedures involved in designing tests. As we have seen, central to assessment is the chain of reasoning from the observations to the claims we make about test takers, on which the decisions about them will be based. Mislevy calls this the “assessment argument. This is needed to establish the relevance of assessment data and its value as evidence. According to Mislevy, An assessment is a machine for reasoning about what students know, can do, or have accomplished, based on a handful of things they say, do, or make in particular settings. Mislevy has developed an approach called Evidence Centered A preliminary first stage, Domain Analysis, involves what in performance assessment is traditionally called job analysis . Here the test developer needs to develop insight into the conceptual and organizational structure of the target domain. All three steps precede the actual writing of specifications for test tasks; they constitute the “thinking stage” of test design. Only when this chain of reasoning is completed can the specifications for test tasks be written. In further stages of Evidence Centered Design, Mislevy goes on to deal with turning the conceptual framework developed in the domain modeling stage into an actual assessment and ensuring that the psychometric properties of the test provide evidence in support of the ultimate claims that we wish to make about test takers. He proposes a series of statistical models in which test data are analyzed to support (or challenge) the logic of the assessment. The discussion here is highly technical and deals in particular with the requirements in terms of measurement, scoring, and logistics for an assessment to implement the relationships in the domain modeling.2 The final outcome of this is an operational assessment. Mislevy’s conceptual analysis is impressive. Note, however, that its consideration of the social dimension of assessment remains implicit and limited to issues of fairness. For example, Mislevy does not consider the context in which tests are commissioned and, thus, cannot problematize the determination of test constructs as a function of their role in the social and policy environment. As we will see in chapter 7, the conceptualizations on which assessments are ultimately based might be determined from outside as part of a policy that might not involve input from applied linguists or experts on language learning at all. Nor does Mislevy deal directly with the uses of test scores, the decisions for which they form the basis, except insofar as they determine the formulation of relevant claims, which, in any case, is taken as a given and is not problematized. ' Kane’s Approach to Test Score Validation' Kane has also developed a systematic approach to thinking through the process of drawing valid inferences from test scores. Kane points out that we interpret scores as having meaning. The same score might have different interpretations. For example, scores on a test consisting of reading comprehension passages might be interpreted in very different ways, such as “a measure of skill at answering passage-related questions,” “a measure of reading comprehension defined more broadly,” “one indicator of verbal aptitude,” or “an indicator of some more general construct, such as intelligence”. Whatever the interpretation we choose, Kane argues,we need an argument to defend the relationship of the score to that interpretation. He calls this an interpretative'' argument'', defined as a “chain of inferences from the observed performances to conclusions ''and decisions ''included in the interpretation. The second inference is from the observed score to what Kane called the universe score, deliberately using terminology from generalizability theory .This inference is that the observed score is consistent across tasks, judges, and occasions. This involves the traditional issue of reliability and can be studied effectively using generalizability theory, item response modeling , and other psychometric techniques. A number of kinds of variable typical of performance assessments in language can threaten the validity of this inference, including raters, task, rating scale, candidate characteristics, and interactions among these . Kane pointed out that generalization across tasks is often poor in complex performance assessments: that because a person can handle a complex writing task involving one topic and supporting stimulus material it does not necessarily mean that the person will perform in a comparable way on another topic and another set of materials. If task generalizability is weak, or if the impact of raters or the rating process is large, then this “bridge” collapses and we cannot move on in the interpretative argument. For Kane, an interpretative argument is only as strong as its weakest link. In performance-based language tests, the impact of such factors as task, rater, interlocutor, and so on has been consistently shown to be considerable, and an aspect of the social responsibility of testers is to report efforts to estimate the impact of these factors and to control them, for example, through double rating and so on. The third type of inference, from the universe score to the target score. This inference involves ''extrapolation ''to nontest behavior—in some cases, via ''explanation ''in terms of a model. The second inference is from the observed score to what Kane called the universe score, deliberately using terminology from generalizability theory. This inference is that the observed score is consistent across tasks, judges, and occasions. This involves the traditional issue of reliability and can be studied effectively using generalizability theory, item response modeling , and other psychometric techniques. A number of kinds of variable typical of performance assessments in language can threaten the validity of this inference, including raters, task, rating scale, candidate characteristics, and interactions among these Here, language testing is like science: Consistent with the broader nomothetic view of science from which it arises, the goal of psychometrics is to develop interpretations that are generalizable across individuals and contexts and to understand the limits of those generalizations. Core methods in psychometrics involve standardized procedures for data collection ''. . . ''. These methods promise replicability and experimental or statistical control of factors deemed irrelevant or ancillary to the variable under study ''. . . ''. Meaningful interpretations are to be found in the patterning of many small, standardized observations within and across individuals. . Kane pointed out that generalization across tasks is often poor in complex performance assessments: that because a person can handle a complex writing task involving one topic and supporting stimulus material it does not necessarily mean that the person will perform in a comparable way on another topic and another set of materials. If task generalizability is weak, or if the impact of raters or the rating process is large, then this “bridge” collapses and we cannot move on in the interpretative argument. For Kane, an interpretative argument is only as strong as its weakest link. In performance-based language tests, the impact of such factors as task, rater, interlocutor, and so on has been consistently shown to be considerable ,and an aspect of the social responsibility of testers is to report efforts to estimate the impact of these factors and to control them, for example, through double rating and so on. The third type of inference, from the universe score to the target score, is that commonly dealt with under the heading of construct validity and is closest to the first cell of Messick’s validity matrix. This inference involves ''extrapolation ''to nontest behavior—in some cases, via ''explanation ''in terms of a model. We have seen how complex it is to establish the relationship between test and nontest behavior, a relationship which goes to the heart of tests and which is the basis for Mislevy’s elaboration of what is involved. The fourth type of inference, from the target score to the decision based on the score, moves the test into the world of test use and test context; it encompasses the material in the second, third, and fourth cells of Messick’s matrix. we might, on the one hand, evaluate the quality of interpretations of test scores without reference to the context of the use of the test. He calls such interpretations descriptive, and he calls the inferences that they involve semantic. However, when we cross the bridge into actual test use on the other hand, we are engaged in decision-based interpretations and policy inferences.